setwd("/Users/tarineccleston/Documents/Masters/STATS 762/regression-for-DS/frog-noises")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.0 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.1 âś” tibble 3.1.8
## âś” lubridate 1.9.2 âś” tidyr 1.3.0
## âś” purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(MuMIn)
frog_df = read.csv("data/frogs.csv")
# find the absolute distance between the frog and each microphone
frog_df$distance = sqrt((abs(frog_df$mic.x - frog_df$animal.x)^2) + (abs(frog_df$mic.y - frog_df$animal.y)^2))
# get the probability of that particular microphone detected that particular frog
# based on the number of calls the frog makes and the detected amount from the mic
frog_df$prob = frog_df$detected / frog_df$n.calls
pairs(frog_df[,c(1,7,8,9,10)])
# fit model
frog_fit <- glm(cbind(detected, n.calls - detected) ~ distance, family = "binomial", data = frog_df)
summary(frog_fit)
##
## Call:
## glm(formula = cbind(detected, n.calls - detected) ~ distance,
## family = "binomial", data = frog_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.0706 -0.7967 -0.2380 0.7482 3.3827
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.39367 0.22121 15.34 <2e-16 ***
## distance -0.79194 0.04199 -18.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1311.15 on 181 degrees of freedom
## Residual deviance: 496.14 on 180 degrees of freedom
## AIC: 625.13
##
## Number of Fisher Scoring iterations: 5
xx <- seq(min(frog_df$distance), max(frog_df$distance), length.out = 1000)
yy <- predict(frog_fit, newdata = data.frame(distance = xx), type = "response")
{plot(frog_df$distance, frog_df$prob, col = c(1:7)[frog_df$mic.id], main = "Distance of Frog from Mic vs Probability of Noise Detection", xlab = "Distance of Frog from Mic (m)", ylab = "Probability of Noise Detection")
legend('topright', legend = c("1","2","3","4","5","6","7"), col = c(1:7), pch = 1)
lines(xx, yy)}
Looking at the data, it appears that the technical mishaps appear to be for some observations of frog noises recorded 0 - 4 meters away from the frog, where the probability of the call detection of the frogs is 0. This is unusual in comparison to other similar distance values in the data as those observations have a probability of detecting calls close to 1. We will do further analysis on this…
plot(frog_fit, which=1:6)
The residuals vs fitted shows a non constant variance in a data, is worth mentioning but shouldn’t be an issue as we are making an inference, not point predictions.
The QQ plot shows a non-normal error distribution with a symmetric short tail. We can ignore this as in most cases, the sampling distribution of the betas will be adequately approximated by the normal distribution.
As shown from the Cook’s distance plot, observations 24, 157, and 143 appear to be influential as they’re above 1.
As shown in the Residuals vs Leverage plot, observations 24, 157, and 143 are located approximately on the bottom right hand corner. These points appear to be both moderately leveraged and outlier (high residual) observations making them influential points.
Let’s take a look at which mic these observations were taken from…
mic_obs_24 = frog_df[24,]$mic.id
mic_obs_143 = frog_df[143,]$mic.id
mic_obs_157 = frog_df[157,]$mic.id
influential_mic_obs = cbind(mic_obs_24, mic_obs_143, mic_obs_157)
rownames(influential_mic_obs) = 'mic.id'
influential_mic_obs
## mic_obs_24 mic_obs_143 mic_obs_157
## mic.id 3 3 3
All influential observations appear to come from mic 3. Let’s look at all observations from mic 3
mic_3 = frog_df %>%
filter(mic.id == 3) %>%
select(mic.id, animal.id, distance, detected, n.calls)
mic_3
## mic.id animal.id distance detected n.calls
## 1 3 1 11.7069424 0 8
## 2 3 2 9.8564294 0 6
## 3 3 3 14.6273887 0 10
## 4 3 4 0.6139218 0 11
## 5 3 5 7.6147291 0 7
## 6 3 6 4.7095329 0 8
## 7 3 7 1.5134398 0 7
## 8 3 8 5.1439479 0 8
## 9 3 9 6.0367955 0 13
## 10 3 10 2.8901211 0 10
## 11 3 11 11.1196043 0 12
## 12 3 12 9.5189285 0 10
## 13 3 13 8.2751918 0 9
## 14 3 14 6.4051620 0 15
## 15 3 15 8.1884919 0 11
## 16 3 16 4.7288265 0 7
## 17 3 17 3.3890412 0 7
## 18 3 18 11.7734617 0 11
## 19 3 19 8.3303061 0 8
## 20 3 20 11.3682408 0 3
## 21 3 21 0.8549269 0 8
## 22 3 22 3.6928715 0 8
## 23 3 23 0.7800641 0 8
## 24 3 24 8.8960272 0 9
## 25 3 25 13.7860836 0 10
## 26 3 26 7.3762118 0 11
Mic 3 fails to detect all frog calls for a range of distances from the mic to the frog This is unexpected, as other mics are able to pickup calls when the frog is close to the the mic. We can assume that observations from mic 3 are untrustworthy as the mic is faulty or didn’t record properly.
Remove all observations which were recorded from mic 3. Since mic 3 is assumed to give incorrect measurements. We assume that the residuals are normally distributed in regression. If some of the measurements are wrong, it can violate this assumption and lead to biased and inefficient estimates of the regression coefficients.
# remove mic 3 observations
frog_corrected_df = frog_df %>%
filter(mic.id != 3)
frog_corrected_df$mic.id = as.factor(frog_corrected_df$mic.id)
# fit model again and show improvement in fit
frog_corrected_fit <- glm(cbind(detected, n.calls - detected) ~ distance, family = "binomial", data = frog_corrected_df)
summary(frog_corrected_fit)
##
## Call:
## glm(formula = cbind(detected, n.calls - detected) ~ distance,
## family = "binomial", data = frog_corrected_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8896 -0.4712 -0.1244 0.3688 2.4608
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.42260 0.39776 16.15 <2e-16 ***
## distance -1.27960 0.07419 -17.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1169.31 on 155 degrees of freedom
## Residual deviance: 116.93 on 154 degrees of freedom
## AIC: 245.92
##
## Number of Fisher Scoring iterations: 6
xx <- seq(min(frog_corrected_df$distance), max(frog_corrected_df$distance), length.out = 1000)
yy <- predict(frog_corrected_fit, newdata = data.frame(distance = xx), type = "response")
{plot(frog_corrected_df$distance, frog_corrected_df$prob, col = c(1:7)[frog_corrected_df$mic.id], main = "Distance of Frog from Mic vs Probability of Noise Detection", xlab = "Distance of Frog from Mic (m)", ylab = "Probability of Noise Detection")
legend('topright', legend = c("1","2","3","4","5","6","7"), col = c(1:7), pch = 1)
lines(xx, yy)}
plot(frog_corrected_fit, which=1:6)
New model with mic 3 data removed appears to still have non constant variance on the residual vs fitted graph.
The QQ plot shows greater normality after removing observations from mic 3, however the plot still shows a non-normal error distribution with a symmetric short tail. We can ignore this as in most cases, the sampling distribution of the betas will be adequately approximated by the normal distribution.
As shown from the Cook’s distance plot, observations 23, 37, and 58 appear to be most influential, however they’re much below 1.
As shown in the Residuals vs Leverage plot, observations 23, 37, and 58 all appear to be somewhat outliers with a high residual, making them somewhat influencial. Let’s investigate further…
mic_obs_23 = frog_corrected_df[23,]$mic.id
mic_obs_37 = frog_corrected_df[37,]$mic.id
mic_obs_58 = frog_corrected_df[58,]$mic.id
influential_mic_obs = cbind(mic_obs_23, mic_obs_37, mic_obs_58)
rownames(influential_mic_obs) = 'mic.id'
influential_mic_obs
## mic_obs_23 mic_obs_37 mic_obs_58
## mic.id 5 1 4
All observations appear to come from random mics. We can infer that these influential points are due to chance, unlike measurements from mic 3 which were untrustworthy as mic 3 was faulty. We keep these points in our model.
Now lets try explaining the probability based on distance, animal ID and mic ID. We will try all possible models using dredge with AIC penalty.
# remove multicollinear variables such as mic and frog coordinates used
# to calculate absolute distance
frog_corrected_2_df = frog_corrected_df %>%
select(animal.id, mic.id, distance, detected, n.calls)
# treat each mic and animal as a factor so we can compare between categories
frog_corrected_2_df$mic.id = as.factor(frog_corrected_2_df$mic.id)
frog_corrected_2_df$animal.id = as.factor(frog_corrected_2_df$animal.id)
# use mic.id and animal.id as explanatory variables
# use exhaustive search methods (AIC) so we don't miss anything
frog_corrected_2_fit <- glm(cbind(detected, n.calls - detected) ~ ., family = "binomial", data = frog_corrected_2_df)
summary(frog_corrected_2_fit)
##
## Call:
## glm(formula = cbind(detected, n.calls - detected) ~ ., family = "binomial",
## data = frog_corrected_2_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.52259 -0.31049 -0.04858 0.22965 1.73976
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.22990 0.73856 7.081 1.43e-12 ***
## animal.id2 0.18650 0.93508 0.199 0.84191
## animal.id3 2.63999 0.90051 2.932 0.00337 **
## animal.id4 2.04137 0.83248 2.452 0.01420 *
## animal.id5 1.93483 1.39153 1.390 0.16440
## animal.id6 2.14617 1.03919 2.065 0.03890 *
## animal.id7 0.83507 0.83466 1.000 0.31707
## animal.id8 2.23798 0.94520 2.368 0.01790 *
## animal.id9 1.89275 0.94623 2.000 0.04547 *
## animal.id10 2.61933 0.90640 2.890 0.00385 **
## animal.id11 0.38947 0.84742 0.460 0.64581
## animal.id12 0.37295 0.86248 0.432 0.66544
## animal.id13 1.28735 0.78392 1.642 0.10055
## animal.id14 1.44982 0.83959 1.727 0.08420 .
## animal.id15 0.04522 0.83540 0.054 0.95683
## animal.id16 0.85527 1.03143 0.829 0.40698
## animal.id17 0.71706 0.80220 0.894 0.37139
## animal.id18 1.54013 0.83454 1.845 0.06497 .
## animal.id19 1.62449 0.82327 1.973 0.04847 *
## animal.id20 -0.61420 1.32760 -0.463 0.64362
## animal.id21 0.81488 0.88427 0.922 0.35677
## animal.id22 1.36915 0.91016 1.504 0.13250
## animal.id23 1.57732 0.86070 1.833 0.06686 .
## animal.id24 1.30168 0.86944 1.497 0.13435
## animal.id25 1.35576 0.95353 1.422 0.15507
## animal.id26 1.45908 0.81834 1.783 0.07459 .
## mic.id2 0.45368 0.49911 0.909 0.36337
## mic.id4 1.50217 0.52072 2.885 0.00392 **
## mic.id5 2.11318 0.43331 4.877 1.08e-06 ***
## mic.id6 2.03751 0.50797 4.011 6.04e-05 ***
## mic.id7 1.16900 0.50746 2.304 0.02124 *
## distance -1.54529 0.10681 -14.468 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1169.307 on 155 degrees of freedom
## Residual deviance: 53.418 on 124 degrees of freedom
## AIC: 242.41
##
## Number of Fisher Scoring iterations: 7
Distance and mic ID seem the most statistically significant compared to animal ID from the summary table above.
options(na.action = "na.fail")
all_fit = dredge(frog_corrected_2_fit)
## Fixed term is "(Intercept)"
head(all_fit)
## Global model call: glm(formula = cbind(detected, n.calls - detected) ~ ., family = "binomial",
## data = frog_corrected_2_df)
## ---
## Model selection table
## (Int) anm.id dst mic.id df logLik AICc delta weight
## 7 5.944 -1.380 + 7 -107.067 228.9 0.00 1
## 3 6.423 -1.280 2 -120.961 246.0 17.11 0
## 8 5.230 + -1.545 + 32 -89.204 259.6 30.69 0
## 4 6.177 + -1.359 27 -106.313 278.4 49.55 0
## 2 -1.099 + 26 -539.998 1142.9 913.99 0
## 6 -1.362 + + 31 -537.139 1152.3 923.39 0
## Models ranked by AICc(x)
plot(all_fit,col="blue")
Models 7, with 7 degrees of freedom seems to perform the best with the lowest AIC. This takes mic 1 as the baseline, and other mic coefficients in comparison with mic 1 and the respective mic. Model 3 follows behind it, with using only distance to explain probability of detecting a call.
best_model = get.models(all_fit, 1)[[1]]
summary(best_model)
##
## Call:
## glm(formula = cbind(detected, n.calls - detected) ~ distance +
## mic.id + 1, family = "binomial", data = frog_corrected_2_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.94526 -0.42030 -0.07574 0.32278 2.40503
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.94446 0.42902 13.856 < 2e-16 ***
## distance -1.38031 0.08416 -16.402 < 2e-16 ***
## mic.id2 0.63770 0.36654 1.740 0.08190 .
## mic.id4 1.30527 0.38889 3.356 0.00079 ***
## mic.id5 1.68061 0.34895 4.816 1.46e-06 ***
## mic.id6 1.27312 0.39109 3.255 0.00113 **
## mic.id7 0.87515 0.38198 2.291 0.02196 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1169.307 on 155 degrees of freedom
## Residual deviance: 89.145 on 149 degrees of freedom
## AIC: 228.13
##
## Number of Fisher Scoring iterations: 7
Frogs all produce calls at about the same volume. We wouldn’t expect some frogs to be more detectable than others based on the characteristics of its call, this explains why we a relatively high p-value for the animal.id factors, and why models which used animal-id in dredge were heavily penalised and are not used in our final model.
As the low p-values suggest for each of the mic.id coefficients, there is a clear difference in frog noise measurement performance between each mic and the baseline (mic 1). This could be due to some microphones being better at detecting calls than others: one placed inside a bush might be less sensitive to frog noises compared to one placed out in the open.
frog_final_fit = glm(cbind(detected, n.calls - detected) ~ distance + mic.id, family = "binomial", data = frog_corrected_df)
xx = seq(min(frog_corrected_df$distance), max(frog_corrected_df$distance), length.out = 1000)
mic_1 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "1"), type = "response")
mic_2 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "2"), type = "response")
mic_4 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "4"), type = "response")
mic_5 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "5"), type = "response")
mic_6 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "6"), type = "response")
mic_7 = predict(frog_final_fit, newdata = data.frame(distance = xx, mic.id = "7"), type = "response")
{plot(frog_corrected_df$distance, frog_corrected_df$prob, col = c(1:2,4:7)[frog_corrected_df$mic.id], main = "Distance of Frog from Mic vs Probability of Noise Detection", xlab = "Distance of Frog from Mic (m)", ylab = "Probability of Noise Detection")
lines(xx, mic_1, col = 1)
lines(xx, mic_2, col = 2)
lines(xx, mic_4, col = 4)
lines(xx, mic_5, col = 5)
lines(xx, mic_6, col = 6)
lines(xx, mic_7, col = 7)
legend('topright', legend = c("mic 1","mic 2","mic 4","mic 5","mic 6","mic7"), col = c(1:2, 4:7), pch = 1)}
Furthermore we can see a clear difference of detection performance between each mic.
plot(frog_final_fit, which=1:6)
All diagnostic plots look the same.
Validate model using deviance lack of fit test
1-pchisq(frog_final_fit$deviance, frog_final_fit$df.residual)
## [1] 0.9999738
At 0.999, the observed deviance statistic is not extreme under the null hypothesis, and there is strong evidence that the fitted model is a good fit for the data.
Therefore, we can conclude that the detection function is the following…
log(odds-ratio) = beta_0_hat + beta_1_hat * distance + beta_2_hat * mic_id2 + beta_4_hat * mic_id4 + beta_5_hat * mic_id5 + beta_6_hat * mic_id6 + beta_7_hat * mic_id7
where mic_id is a categorical factor
where
Probability of Success = odds-ratio / (1 + odds-ratio)
exp(summary(frog_final_fit)$coeff["distance","Estimate"])
## [1] 0.2515001
Calculating the odds ratio shows that the probability of detection decreases by 26% for every meter further away the frog is from the microphone. Additionally for a frog at a fixed distance away from a microphone, the log probability increases or decreases depending on which microphone it is.
Going back onto our question, we want to be able to fit a model that estimates the effect of distance between a frog and a microphone on the probability that a call produced by the frog is detected by the microphone. This basically is a detection function. If we were to factor in the mic ID as an explanatory variable, this will highly depend on the placement for each mic, which means that the model generated from each mic for this dataset would be biased towards to this sample and not generalised for say, new sampled data. If we wanted to measure the detection for this particular sample, we keep the mic.id term, if we wanted a more generalised solution for further field use, we would drop the mic ID terms and just use distance to explain probability of detecting a frog call, this should give similar performance.